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Abstract 

High-confidence prediction of complex traits such as disease risk or drug response is an ultimate goal 
of personalized medicine. Although genome-wide association studies have discovered thousands of well- 
replicated polymorphisms associated with a broad spectrum of complex traits, the combined predictive 
power of these associations for any given trait is generally too low to be of clinical relevance. We propose a 
novel systems approach to complex trait prediction, which leverages similarity in genetic, transcriptomic 
or other omics-level data. We translate the omic similarity into phenotypic similarity using a method 
called Kriging, commonly used in geostatistics and machine learning. Our method called OmicKriging 
emphasizes the use of a wide variety of systems-level data, such as those increasingly made available by 
comprehensive surveys of the genome, transcriptome and epigenome, for complex trait prediction. The 
approach facilitates exploration of the etiology of disease risk or drug response by quantifying specific 
omic contributions to prediction. Our method is a fast, simple and flexible approach to polygenic, 
and more generally, poly-omic, prediction. Using seven diseases from the Wellcome Trust Case Control 
Consortium (WTCCC), we show that our method yields performance similar to more computationally 
intensive methods when restricted to genotypic data. Using a cellular growth phenotype, we show that 
integrating mRNA and microRNA data with genotypic data substantially increases performance. We 
provide guidelines on how to choose the similarity matrices and an R package to implement OmicKriging 
(http: //www. scandb . org/newinterf ace/t ools/Omi cKriging . html[ ) . 
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Introduction 

High-confidence prediction of complex traits is an ultimate goal of personalized medicine. For heritable 
traits, genotypic similarity contributes to phenotypic similarity. While genome-wide association studies 
(GWAS) have revealed many statistically significant loci associated with complex traits, the small effect 
sizes of most loci limit their prediction utility 1 . For most complex traits, an appreciable proportion 
of phenotypic variance is explained only when polygenic models, rather than single-marker tests, are 
used [2][3]. For example, 45% of the variance in human height can be explained using a mixed linear 
modeling approach (GCTA) that simultaneously considers all 300K common SNPs genotyped [3j|4j- 
Because this approach does not rely on selection of loci, it is thought to be appropriate for traits with a 
highly polygenic or infinitesimal genetic architecture [5j[6j- 

In order to predict phenotypes from genomic data, our approach borrows from the field of geostatistics. 
Kriging interpolates the value of some geographical measurement such as yearly rainfall at an unobserved 
location [7jj9]. The approach assumes the rainfall measured at nearby locations will be more similar 
to that at the unmeasured location than rainfall at sites farther away. Kriging uses weights based on 
the correlation matrix of these locations along with the observed rainfall amounts to predict the rainfall 
at the unmeasured location. The locations are analogous to individuals and the rainfall amounts are 
analogous to the phenotype of interest (Figure [TJ. In our application to complex trait prediction, the 
notion of distance can be replaced with that of genetic similarity. Close ties between genetic distance 
and geographic distance have been demonstrated in studies of human population structure. For example, 
in the analysis of genome-wide genotype data from 3000 Europeans, a geographic map of Europe arose 
naturally when the first two principal components of the data were plotted [10| . A diagram of the analogy 
between geostatistical kriging and complex trait prediction is shown in Figure [T] The loose connection 
between kriging and linear regression is illustrated in Figure [2j 

Complex trait prediction using whole-genome data has been previously described [T|[5|[6{ [TT] . We 
propose a novel systems approach to predict complex traits, which leverages similarity in genetic, tran- 
scriptomic, and/or other omics-level data. Here, we describe our OmicKriging method, apply the method 
to several human complex traits (cellular growth and seven WTCCC diseases), and provide an R package 
for implementation. 
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Results 

In implementing OmicKriging, the first step is to construct a similarity matrix or similarity matrices 
for use in the prediction. Such matrices may include a genetic relationship matrix (GRM) from a set of 
SNPs, a gene expression correlation matrix (GXM) from gene expression data, or any other correlation 
matrix derived from omics-level data. Each matrix can be tested individually or in a given weighted 
combination for phenotype predictive performance. The optimal correlation matrix depends on the 
genetic architecture of the phenotype. To predict the phenotype of an individual, the weighted average of 
the remaining individuals' phenotypes is calculated. In the case of SNP data, the weights are comprised of 
the GRM and pairwise genotype correlations of the unknown individual with the genotypes of those with 
observed phenotypes. The inclusion of covariates such as principal components from the GRM to correct 



for population structure or eigengenes calculated from the GXM 12 may also improve the heritability 
estimate and phenotypic prediction. 

In applying OmicKriging, we used a cross-validation approach. We regressed the predicted phenotype 
on the true phenotype to measure performance. Note that in the OmicKriging method, weights are 
computed out-of-sample (only an individual's genotype is used to compute pairwise correlations) and 
should not be compared to reports where the R 2 is computed using parameters estimated from the same 
data (in-sample). In terms of computational time, the prediction of each test set for a sample size of 
about 2500 takes less than 30 seconds. When we divided the full sample into 10 test sets, the prediction 
takes less than five minutes. This first version of the OmicKriging software written in R is not optimized 
for speed and thus, further optimization is possible. 

Cellular phenotype applications 

To assess the predictive performance of OmicKriging, we used the intrinsic growth rate (iGrowth) pheno- 
type derived from multiple proliferation measurements in the commonly used HapMap lymphoblastoid 



cell lines (LCLs) 13 . iGrowth values from 99 LCLs from the HapMap CEU (Northern and Western 
European ancestry from Utah) and YRI (Yoruba from Ibadan, Nigeria) populations were used in the 
analysis. The GRM was generated from 2.7 million common SNPs (minor allele frequency > 0.05) from 



HapMap 14 and the GXM was generated from 13,080 transcript clusters from a previous genome-wide 



gene expression analysis [15|. The first 10 principal components from the genotype data and the first 
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10 eigengenes from the expression data were used as covariates in the OmicKriging prediction. We also 
obtained expression measurements for 201 microRNAs [16 and generated a microRNA expression corre- 
lation matrix (MXM). To determine which combination of correlation matrices was best able to predict 
iGrowth, we weighted the matrices in various combinations and compared the OmicKriging results. Each 
matrix alone was able to significantly predict iGrowth with the GXM producing the largest correlation 
(R 2 =0.38) of the three (Table [TJ Figure l3|) . However, when all three matrices were weighted equally, the 
prediction improved (R 2 =0.40). This indicates that combining omics-level data improved the predictive 
power of OmicKriging over using the GXM alone. When using a larger sample size (n=159) for which we 
had gene expression data, but not microRNA data, we were able to achieve a predictive R 2 =0.51 based 



Clinical phenotype applications 

To test the predictive power of OmicKriging using clinical phenotypes, we turned to the seven diseases 
of the Wellcome Trust Case Control Consortium (WTCCC) [17]. First, for each case/control dataset, all 
genotyped common SNPs (approximately 400,000) were used to generate a GRM. OmicKriging was run 
using the correlations from these GRMs and case/control prediction was successful for all seven diseases 



(Table [2]). In an attempt to further improve the prediction, we generated a second GRM for each disease 

using the SNPs within lOOkb of known large effect SNPs (identified outside of WTCCC studies) listed in 
the The National Human Genome Research Institute GWAS catalog uM and he Database of Genotypes 
and Phenotypes (dbGAP) ^9\. Equally weighting the first GRM of all SNPs and the second GRM of 
known loci improved the predictive power of OmicKriging over using just the first GRM alone for Crohn's 
disease, rheumatoid arthritis, type 1 diabetes and type 2 diabetes (Table [2] Figure [1]). The type 1 di- 
abetes prediction showed the largest improvement when the second GRM was added: the R 2 increased 
from 0.13 to 0.42 and the area under the ROC curve (AUC) increased from 0.71 to 0.89 (Table[2) Figure|I]). 





Discussion 



We propose a novel systems approach to predict complex traits, which leverages similarity in genetic, 
transcriptomic and/or other omics-level data. We translate the genomic similarity into phenotypic sim- 
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ilarity using a method called Kriging, commonly used in geostatistics and machine learning. Here, we 
construct the genomic similarity using a linear combination of omic correlation matrices but more general 
similarity matrices can be used. Given the similarity matrix, the prediction is obtained by simply com- 
puting a weighted average of the phenotype of individuals in the training set. The weights are provided by 
the Kriging method, which can be loosely interpreted as a converse of the linear regression method. Our 
method is a fast, simple and flexible approach to polygenic, and more generally poly-omic, prediction. 

The key component of the method is the choice of the similarity matrix. This can be estimated using 
likelihood-based methods (such as implemented in GCTA [4]) or Bayesian methods (such as BSLMM [ll]). 
We can also take a more pragmatic approach and test multiple weights for each correlation matrix 
component (large effect GRM, small effect GRM, GXM, etc.) to determine those that maximize prediction 
performance. When testing the performance of our method, we used our suggested starting values for 
the weights for the similarity matrix components (Table pjj), and no optimization was done. Thus, our 
reported performance measures are conservative. 

Using genotypic, mRNA and microRNA data in HapMap lymphoblastoid cell lines, we predict the 
cellular intrinsic growth phenotype with a striking out-of-sample R 2 of 40% (51% when n=159, for which 
only genotypic and mRNA expression data were available). We show here that the gene expression 
correlations are better than the SNP correlations at predicting the growth phenotype and that adding 
microRNA expression correlations slightly improves the GXM-only prediction (Table [l}. Given that 30% 
of gene expression levels associate with the intrinsic growth phenotype (FDR<0.1) [13], it is perhaps not 
surprising that gene expression correlations are most useful in predicting this phenotype. 

We also successfully predicted seven clinical phenotypes with OmicKriging. We show that our method 
yields performance similar to other computationally more intensive methods when restricted to genotypic 
data (Figure [4]). For the known autoimmune diseases (Crohn's disease, rheumatoid arthritis, type 1 
diabetes) , adding a second GRM of known loci increased the prediction AUC values over a common SNP 
GRM alone to the AUC values obtained by BSLMM [II] (Figure [4]). These autoimmune diseases are 
known to have multiple associated loci of relatively strong effects, so this prior knowledge was used to 



improve prediction performance 20 . Unless replicated in an independent study, the results from the 
WTCCC data were not used to select top SNPs to avoid ovcrfitting the data. 

We provide guidelines on how to choose the similarity matrices and an R package to implement the 



kriging algorithm called OmicKriging (http://www.sccLndb.org/newinterface/tools/DmicKriging. 



html). 
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Methods 

Kriging Formula 

Within the kriging framework, the predicted phenotype of a test individual is computed as the weighted 
average of the phenotype of the individuals in the training set. 

Prediction(Y new ) = w{Yi + lj 2 Y 2 H hw„7„ (1) 

where the weights oji are a function of all n(n + l)/2 pairs of correlations. In the simplest case where no 
covariates are needed, the weights prescribed by the kriging method are given by 

w = xrV (2) 

where p is the correlation vector between the test individual and the training individuals and X is the 
correlation matrix of the individuals in the training set [7] . Covariates are easily included in the method 
by using the so called universal kriging approach V7\ . Assuming there are p covariates (if only the intercept 
is considered, p = 1), let z be the p by 1 vector with covariates 1 to p corresponding to the test individual 
and Z be the n by p matrix with the p covariates for the n individuals in the training set. The weights 
become 

u = £- 1 {p + Zm) (3) 

where m = (Z'S"^)" 1 ^ - UYr x p). 
Optimal similarity measure 

A key component of the success of OmicKriging is understanding which proximity measures translate 
best into phenotypic similarity. The optimal correlation matrix depends on the underlying genetic and 
epigenetic architecture of the complex trait. 
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Under a general additive poly-omic model, the phenotype for individual i can be written as 

Y l =a + G i + T l + O i + ■■■+€, 

where 

• a is a constant, 

• d = J2i P? x ? i s the- additive genetic component and fif is the effect size of the standardized 
genotype Xf, 

• Ti ~J2i &i x T and j3f is the effect size of the standardized gene expression Xf, 

• Oi —J2i P? is the additive (other) omic component and fif is the effect size of the standardized 
omic level Xf , 

• and ej is a noise term. 

We assume a random effects model for the j3's and the covariances between /3's of different omic origins. 
For simplicity, we also assume that /3's from the same omic origin are independent. 
With these assumptions, the optimal correlation matrix has the form 

= G ^ X ik X jk + ^ X Ik X Jk + 60 ^ X ik X jk + ^eSij 
k k k 

+ (>GT X fk X Jl + X! X ik X fl) 

kl k 

+ (>GO X fk X fl + ^2 X °k X fl) 

kl k 

+ Oto X ik X jl + X! X ik X jl) 

kl k 

where Sij is the kronecker delta (1 if i = j and otherwise) and the weights 9 X can be (at least theoreti- 
cally) estimated using likelihood-based or Bayesian approaches. 

In case prior expression quantitative trait loci (eQTL) information is available, it may be possible 
to restrict the cross-correlation terms to known eQTL pairs (X k ,Ti). However, care must be taken to 
preserve positive definitcness of T (all eigenvalues must be > 0). In this paper, to keep computations 
within reach, we ignore the cross-correlation terms. 
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Genotype-only analysis 
LMM model 



When only genotypic data are available, the following model is commonly assumed [3] 

M 



fc=i 



where the effect sizes /3fe's are assumed to be normally distributed N(0,cr|), is the n by 1 vector of 
standardized genotypic dosage for marker fc, and M is the number of genetic markers. In this case, the 
optimal correlation matrix is given by the genetic rclatedness matrix X'X, where X is the matrix with 



genotypic data for all n individuals. Following the Zhou et al. 11 terminology we call this model LMM 



BSLMM model 

If the trait has a polygenic background with a handful of known large effect loci, we can leverage that 
information by using two genetic relatedness matrices and assign comparable weight to each component. 
This model can be represented as 

M M 

fe=i fe=i 

where f3k, X, and M are as defined in the LMM model, /3% is zero with probability 1 — it and normally 
distributed N(0, er^) with probability it, and < a\. Again, following the Zhou et al. fllj terminology 
we call this model BSLMM. In this case, the optimal correlation matrix has the form #iX'X + ^X^X^ 
where X is the matrix with all genotypic data and X^ is the matrix of the subset of markers with large 
effects. We do not attempt to estimate from the data which markers have large effect sizes but use mark- 
ers published in the NHGRI catalog and the ones within lOOkb of them to build a large effect genetic 
relatedness matrix. When selecting the large effect variants, it is crucial not to use results from the 
dataset that is being predicted in order to avoid overfitting the data. 
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Tables 

Table 1. OmicKriging results for iGrowth prediction using various correlation matrices 
and weights. 



GRM weight 


GXM weight 


MXM weight 


Covariatcs 


n 


R 2 


P-valuc 


1 








PCs,EGs 


99 


0.31 


1.6xl0" 9 





1 





PCs,EGs 


99 


0.38 


l.OxlO" 11 








1 


PCs,EGs 


99 


0.27 


3.0xl0" 8 


0.33 


0.33 


0.33 


PCs,EGs 


99 


0.40 


2.4xl0" 12 


0.5 


0.5 


NA 


PCs,EGs 


159 


0.33 


2.7xl0" 15 


1 





NA 


PCs,EGs 


159 


0.29 


2.5xl0~ 13 





1 


NA 


PCs,EGs 


159 


0.51 


4.7xKT 2B 



GRM=gcnetic relationship matrix, GXM=genc expression matrix, MXM=microRNA expression 
matrix, PCs=principal components, EGs=eigengenes 
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Table 2. OmicKriging results for complex disease prediction. 



Disease 


GRM1 
(weight) 


GRM2 
(weight) 


Cases 


Controls 


R 2 


P-value 


ROC 
AUC 


Bipolar disorder 


all SNPs 
(1) 


NA 


2938 


1868 


0.074 


7xl0~ 82 


0.657 


Bipolar disorder 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1868 


0.049 


7xl0~ 55 


0.630 


Coronary artery disease 


all SNPs 
(1) 


NA 


2938 


1926 


0.030 


3xl0- 34 


0.598 


Coronary artery disease 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1926 


0.034 


5xl0- 38 


0.603 


Crohn's disease 


all SNPs 
(1) 


NA 


2938 


1748 


0.057 


5xl0~ 62 


0.637 


Crohn's disease 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1748 


0.10 


3xl0~ i14 


0.690 


Hypertension 


all SNPs 
(1) 


NA 


2938 
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0.036 


3xl0~ 40 


0.610 


Hypertension 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1952 


0.032 


lxlO- 36 


0.604 


Rheumatoid arthritis 


all SNPs 
(1) 


NA 


2938 


1860 


0.070 


4xl0~ 77 


0.654 


Rheumatoid arthritis 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1860 


0.158 


5xKr 181 


0.734 


Type 1 diabetes 


all SNPs 
(1) 


NA 


2938 


1963 


0.126 


4xl0~ 145 


0.710 


Type 1 diabetes 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1963 


0.416 


<lxl0" 2UU 


0.888 


Type 2 diabetes 


all SNPs 
(1) 


NA 


2938 


1924 


0.040 


lxlO- 44 


0.617 


Type 2 diabetes 


all SNPs 
(0.5) 


known 
loci (0.5) 


2938 


1924 


0.055 


3xl0~ 62 


0.635 



GRM=genetic relationship matrix, ROC AUC=receiver operating characteristic area under the curve 



Table 3. Suggested initial weights for similarity matrices. These are suggested starting values 
for the weights of each component similarity matrix. They can be modified in order to maximize 
out-of-sample prediction. See the Methods section for term definitions. 
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Y t = a + H 












Gi 


i 








LMM 


Gt + Gi 


0.5 


0.5 






BSLMM 
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LMM+gene expression 


G l + T l + O l 


0.33 




0.33 


0.33 


LMM+gcne expression+other 
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Physical variable 
(ex. rainfall) 




Complex trait 
(ex. height) 


M ► 




OtC 




OtO 





Closer locations get 
larger weights 



More related individuals 
get larger weights 



Locations 




>■ ■< 
Individuals 

<. * 


M ► 


Physical proximity 

S 4 


/ * 

Genetic proximity 


M ► 



Physical similarity 



Phenotypic similarity 



Figure 1. Kriging and whole-genome prediction connection. This figure shows the analogous 
relationships between components of the kriging method used in geostatistics and whole-genome 
prediction. The prediction at an unobserved location (?) is computed as a weighted average of the 
variable at observed locations. The weights are functions of the correlation between the rainfall at the 
new location and the rainfalls at the observed locations. The closer the distance between each observed 
location and the new location, the higher the weight. In complex trait prediction, locations correspond 
to individuals, physical proximity corresponds to genetic relatedness. The correlation between two 
locations or individuals is the key component of this method. In animal breeding approaches, the 
genetic relatedness matrix or kinship matrix is used. In OmicKriging, a genetic relatedness matrix, a 
gene expression similarity matrix, or any combination of available high-throughput data similarity 
measures can be tested for complex trait prediction performance. 
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Figure 2. Kriging and linear regression connection. In linear regression, we have many 
observations of Y and predictors and estimate the regression coefficients (weights) as the inverse of the 
sample covariance matrix of the observed values (covariates) times the sample covariance between the 
outcome Y and the covariates. In contrast, in kriging, we use weights (estimated from other means) to 
predict Y. Notice the weights in both cases have the same functional form. 
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Figure 3. iGrowth prediction using OmicKriging. Predicted versus true iGrowth (n=99) using 
the genetic relationship matrix (GRM) alone, the microRNA expression matrix (MXM) alone, the gene 
expression matrix (GXM) alone, and all three matrices weighted equally. Among these four weightings, 
the combination of all three omics matrices predicted iGrowth best, which indicates that the addition of 
the GRM and MXM improved the prediction over using the GXM alone. The solid black line represents 
the slope of the regression between the predicted and true values. The red dashed line is the identity 
line (slope 1, intercept 0). 
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Figure 4. OmicKriging prediction performs similarly to BSLMM. The area under the ROC 
curve (AUC) is plotted for three prediction methods for each disease from the WTCCC. The Bayesian 
Sparse Linear Mixed Model (BSLMM) results are from Zhou et al. 11 . The OmicKriging results use a 
single common SNP genetic relationship matrix (1GRM) and two equally weighted GRMs of common 
SNPs and known loci (2GRM) for the predictions. The 1GRM results are similar to BSLMM for BD, 
CAD, HT, and T2D. The 2GRM results are similar to BSLMM for CD, RA, and T1D, which are known 
to have multiple associated loci of relatively large effect. The known loci were obtained from studies 
that did not include the WTCCC data to avoid over-fitting. BD=bipolar disorder, CAD=coronary 
artery disease, CD=Crohn's disease, HT=hypertension, RA=rheumatoid arthritis, TlD=type 1 
diabetes, T2D=type 2 diabetes. 



